Parallel Aligned Treebank Corpora at LDC: Methodology, Annotation and Integration

نویسندگان

Xuansong Li

Stephanie Strassel

Stephen Grimes

Safa Ismael

Xiaoyi Ma

Niyu Ge

Ann Bies

Nianwen Xue

Mohamed Maamouri

چکیده

The interest in syntactically-annotated data for improving machine translation quality has spurred the growing demand for parallel aligned treebank data. To meet this demand, the Linguistic Data Consortium (LDC) has created large volume, multi-lingual and multi-level aligned treebank corpora by aligning and integrating existing treebank annotation resources. Such corpora are more useful when the alignment is further enriched with contextual and linguistic information. This paper details how we create these enriched parallel aligned corpora, addressing approaches, methodologies, theories, technologies, complications, and cross-lingual features.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Creating Arabic-English Parallel Word-Aligned Treebank Corpora at LDC

This contribution describes an Arabic-English parallel word aligned treebank corpus from the Linguistic Data Consortium that is currently under production. Herein we primarily focus on efforts required to assemble the package and instructions for using it. It was crucial that word alignment be performed on tokens produced during treebanking to ensure cohesion and greater utility of the corpus. ...

متن کامل

Parallel Aligned Treebanks at LDC: New Challenges Interfacing Existing Infrastructures

Parallel aligned treebanks (PAT) are linguistic corpora annotated with morphological and syntactic structures that are aligned at sentence as well as sub-sentence levels. They are valuable resources for improving machine translation (MT) quality. Recently, there has been an increasing demand for such data, especially for divergent language pairs. The Linguistic Data Consortium (LDC) and its aca...

متن کامل

Querying Both Parallel And Treebank Corpora: Evaluation Of A Corpus Query System

The last decade has seen a large increase in the number of available corpus query systems. Some of these are optimized for a particular kind of linguistic annotation (e.g., time-aligned, treebank, word-oriented, etc.). In this paper, we report on our own corpus query system, called Emdros. Emdros is very generic, and can be applied to almost any kind of linguistic annotation using almost any li...

متن کامل

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

Large Semantic Network Manual Annotation

This abstract describes a project aiming at manual annotation of the content of natural language utterances in a parallel text corpora. The formalism used in this project is MultiNet – Multilayered Extended Semantic Network. The annotation should be incorporated into Prague Dependency Treebank as a new annotation layer.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Parallel Aligned Treebank Corpora at LDC: Methodology, Annotation and Integration

نویسندگان

چکیده

منابع مشابه

Creating Arabic-English Parallel Word-Aligned Treebank Corpora at LDC

Parallel Aligned Treebanks at LDC: New Challenges Interfacing Existing Infrastructures

Querying Both Parallel And Treebank Corpora: Evaluation Of A Corpus Query System

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

Large Semantic Network Manual Annotation

عنوان ژورنال:

اشتراک گذاری